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[57] ABSTRACT 

A surveillance system having at least one primary video 
camera for translating real images of a zone into electronic 
video signals at a first level of resolution. The system 
includes means for sampling movements of an individual or 
individuals located within the zone from the video signal 
output from at least one video camera. Video signals of 
sampled movements of the individual is electronically com- 
pared with known characteristics of movements which are 
indicative of individuals having a criminal intent. The level 
of criminal intent of the individual or individuals is then 
determined and an appropriate alarm signal is produced 

1 Claim, 6 Drawing Sheets 
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ABNORMALITY DETECTION AND 
SURVEILLANCE SYSTEM 

CROSS-REFERENCE TO COPENDING PATENT 
APPLICAnON 

This is a continuation in part of patent application Ser. No. 
08/367.712, filed Jan. 3. 1995, now U.S. Pat. No. 5,666,157. 

BACKGROUND OF THE INVENTION 

A) Field of the Invention 

This invention generally relates to surveillance systems, 
and more particularly, to trainable surveillance systems 
which detect and respond to specific abnormal video and 
audio input signals. 

B ) Background of the Invention 

Today's surveillance systems vary in complexity, effi- 
ciency and accuracy. Earlier surveillance systems use sev- 
eral closed circuit cameras, each connected to a devoted 
monitor. This type of system works sufficiently well for 
low-coverage sites, i.e.. areas requiring up to perhaps six 
cameras. In such a system, a single person could scan the six 
monitors, in "real** time, and eflfectively monitor the entire 
(albeit small) protected area, offering a relatively high level 
of readiness to respond to an abnormal act or situation 
observed within the protected area. In this simplest of 
surveillance systems, it is left to the discretion of security 
personnel to determine, first if there is any abnormal event 
in progress within the protected area, second, the level of 
concern placed on that particular event, and third, what 
actions should be taken in response to the particular event. 
The reliability of the entire system depends on the alertness 
and efficiency of the worker observing the monitors. 

Many surveillance systems, however, require the use of a 
greater number of cameras (e.g., more than six) to police a 
larger area, such as at least every room located within a large 
museum. To adequately ensure reliable and complete sur- 
veillance within the protected area, either more personnel 
must be employed to constantly watch the additionally 
required monitors (one per camera), or fewer monitors may 
be used on a single rotation schedule wherein one monitor 
sequentially displays the output images of several cameras, 
displaying the images of each camera for perhaps a few 
seconds. In another prior art surveillance system (referred to 
as the "QUAD" system), four cameras are connected to a 
single monitor whose screen continuously and simulta- 
neously displays the four different images. In a **quaded 
quad** prior art surveillance system, sixteen cameras are 
linked to a single monitor whose screen now displays, 
continuously and simultaneously all sixteen different 
images. These improvements allow fewer personnel to 
adequately supervise the monitors to cover the larger pro- 
tected area. 

These improvements, however, still require the constant 
attention of at least one person. The above described 
multiple-image/single screen systems suffered firom poor 
resolution and complex viewing. The reliability of the entire 
system is still dependent on the alertness and efficiency of 
the security personnel watching the monitors. The personnel 
watching the monitors are still burdened with identifying an 
abnormal act or condition shown on one of the monitors, 
determining which camera, and which corresponding zone 
of the protected area is recording the abnormal event, 
determining the level of concern placed on the particular 
event, and finally, determining the appropriate actions that 
must be taken to respond to the particular event 
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Eventually, it was recognized that human personnel could 
not reliably monitor the '*real-time" images from one or 
several cameras for long '*watch" periods of time. It is 
natural for any person to become bored while performing a 

5 monotonous task, such as staring at one or several monitors 
continuously, waiting for something unusual or abnormal to 
occur; something which may never occur. 

As discussed above, it is the human link which lowers the 
overall reliability of the entire surveillance system. U.S. Pat 

10 No. 4,737.847 issued to Araki et al. discloses an improved 
abnormality surveillance system wherein motion sensors are 
positioned within a protected area to first determine the 
presence of an object of interest, such as an intruder. In the 
system disclosed by U.S. Pat. No. 4,737,847. zones having 

15 prescribed **warning levels'* are defined within the protected 
area. Depending on which of these zones an object or person 
is detected in. moves to. and the length of time the detected 
object or person remains in a particular zone determines 
whetiier the object or person entering the zone should be 

20 considered an abnormal event or a threat. 

The surveillance system disclosed in U.S. Pat No. 4,737. 
847 does remove some of the monitoring responsibility 
otherwise placed on human personnel, however, such a 
system can only determine an intruder's "intent" by his 
presence relative to particular zones. The actual movements 
and sounds of the intruder are not measured or observed. A 
skilled criminal could easily determine the warning levels of 
obvious zones within a protected area and act accordingly; 
spending little time in zones having a high warning level, for 
example. 

It is therefore an object of the present invention to provide 
a surveillance system which overcomes the problems of the 
prior art 

35 It is another object of the invention to provide such a 
surveillance system wherein a potentially abnormal event is 
determined by a computer prior to summoning a human 
supervisor. 

It is another object of the invention to provide a surveil- 
40 lance system which compares specific measured movements 
of a particular person or persons with a trainable, predeter- 
mined set of ''typical" movements to determine the level and 
type of criminal or mischievous event 

It is another object of this invention to provide a surveil- 
^5 lance system which transmits the data from various sensors 
to a location where it can be recorded for evidentiary 
purposes. It is another object of this invention to provide 
such surveillance system which is operational day and night. 

It is another object of this invention to provide a surveil- 
lance system which can cull out real-time events which 
indicate criminal intent using a weapon, by resolving the low 
temperature of the weapon relative to the higher body 
temperature and by recognizing the stances taken by the 
person with the weapon. 

It is yet another object of this invention to provide a 
surveillance system which does not require "real time" 
observation by human personnel. 

INCORPORATED BY REFERENCE 

60 

The content of the following references is hereby incor- 
porated by reference. 

1. Motz L. and L. Bergstein "Zoom Lens Systems". 
Journal of Optical Society of America, 3 papers in Vol. 

65 52, 1992. 

2. D. G. Aviv, "Sensor Software Assessment of Advanced 
Earth Resources Satellite Systems". ARC Inc. Report 
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#70-80-A. pp2-107 through 2-119; NASA contract FIG. 2C illustrates a frame K+10 of the video camera's 

NAS- 1-16366. output, according to the invention, showing the current 

3. Shio, A. and J. Sklansky "Segmentation of People in location of objects A, B, C, D, and E; 

Motion", Proc. of IEEE Workshop on Visual Motion, FIG. 2D illustrates a frame K+U of the video camera's 

Princeton, NJ.. October 1991. 5 ou^ut. according to the invention, showing object B next to 

4. Agarwal, R. and J Sklansky "Estimating Optical How object C, and object E moving to the right; 

from Clustered Trajectory Velocity Time". FIG. 2E illustrates a frame K+12 of the video camera's 

5. Suzuki, S. and J Sklansky "Extracting Non-Rigid output, according to the invention, showing a potential crime 
Moving Objects by Temporal Edges", IEEE, 1992, taking place between objects B and C; 

Transactions of Pattern Recognition. FIG. 2F illustrates a frame K+13 of the video camera's 

6. Rabiner, L. and Biing-Hwang Juang "Fundamental of output, according to the invention, showing objects B and C 
Speech Recognition", Pub. Prentice Hall, 1993, interacting; 

(p.434-^95). FIG. 2G illustrates a frame K+15 of the video camera's 

7. Weibel, A. and Kai-Fu Lee Eds. "Readings in Speech j5 output, according to the invention, showing object C moving 
Recognition", Pub. Morgan Kaaufman, 1990 to the right and object B following; 

(p.267-296). FIG. 2H illustrates a frame K+16 of the video camera's 

8. Rabiner, L. "Speech Recognition and Speech Synthesis output, according to the invention, showing object C moving 
Systems", Proc, IEEE, January, 1994. away from a stationary object B; 

SUMMARY OF THE INVENTION illustrates a frame K+17 of the video camera's 

output, according to the invention, showing object B moving 

A surveillance system having at least one primary video towards object C. 

camera for translating real images of a zone into electronic _^ -n * ^ f -a ^ # 

_ ^ , , ^ . ^. FIG. 3A illustrates a frame of a video camera s output. 



according to the invention, showing a "two on one" inter- 



video signals at a first level of resolution; 

means for sampling movements of an individual or indi- 25 "ctionTf obieOs (p^pl^^^^^^ C; 

vidualslocated within the zone from the video signal output „ _ . , . ^ ^ 

^ , , , . . & f pjQ 3g illustrates a later frame of the video camera's 

from at least one video camera; ^ _ ^ . ... u • 

, ^ output of FIG. 3A, according to the mvention. showmg 

means for electronicaUy comparing the video signals of ^^.^^^ ^ ^ ^^^^^^ ^^j^^ 

sampled movements of the individual with known charac- ... . ^ ^ . ^ 

teristics of movements which are indicative of individuals 30 FIG. 3C lUustrates a later frame of the video camera s 

having a criminal intent; accordmg to the invenUon showmg 

° , .. .i ^t. objects A and C moving in close proxmuty to object B; 

means for determimng the level of criminal intent of the , 

individual or individuals; illustrates a later frame of the video camera s 

^ ^ , , . . output of FIG. 3C, according to the invention, showing 

means for activatmg at least one secondary sensor and i.- _ a ^ r> • 1 1 • ^ w t> 

. . J- J • I. • J u* u I I f 35 objects A and C quickly moving away from object B. 

associated recording device havmg a second higher level of j c> j j 

resoluUon, said activating means being in response to deter- ^1^. 4 is a schematic block diagram of a conventional 

mining that the individual has a predetermined level of word recogniUon system; and 

criminal intent. FIG. 5 is a schematic block diagram of a video and verbal 

A method for determining criminal activity by an indi- ^ recognition system, according to the invention, 

vidual within a field of view of a video camera, said method DETAILED DESCRIPTION OF THE 

conipnsing: PREFERRED EMBODIMENTS 

sampling the movements of an individual located within 

said field of view using said video camera to generate a Referring to FIG. 1, the basic elements of one embodi- 

video signal: 45 ^^^^ of the invention are illustrated, including picture input 

electronically comparing said video signal of said with ^^^^ « which may be any conventional electronic picture 

known characteristics of movements that are indicative of P^^^-^P ^^ice operational within the infrared or visual 

individuals having a criminal intent; spectrum (or both) including a vidicon and a CCD/TV 

determining the level of criminal intent of said individual. ^^^^^ (including the wireless type), 

said determimng step being dependent on said electronicaUy 50 ^ ^"other embodiment of picture mput means I (», there is 

comparing step* and deployment of a high rate camera/recorder (similar to 

generaUng a "signal indicating a predetermined level of ^^.^^ '^^^^ NAC Visud Systems of Woodland HiUs. 

criLnal intL is present as det^mined by said determining Calif.. SONY and others). Such high rate camera/recorder 

systems are able to detect and record very rapid movements 
55 of body parts that are commonly indicative of a criminal 

BRIEF DESCRIPTION OF THE DRAWINGS intent. Such fast movements might not be resolved with a 

FIG. 1 is a schematic block diagram of the video, analysis, more standard 30 frames per second camera. However, most 

control, alarm and recording subsystems embodying this movements will be resolved with a standard 30 frames per 

invention; second camera, 

FIG. 2A illustrates a frame K of a video camera's output 60 This picture means, may also be triggered by an alert 

of a particular environment, according to the invention, signal from the processor of the low resolution camera or, as 

showing four representative objects (people) A, B,C, and D, before, from the audio/word recognition processor when 

wherein objects A, B and D are moving in a direction sensing a suspicious event. 

indicated with arrows, and object C is not moving; In this first embodiment, the primary picture input means 

FIG. 2B illustrates a frame K+5 of the video camera's 65 1() is preferably a low cost video camera wherein high 

output, according to the invention, showing objects A, B, resolution is not necessary and due to the relative expense 

and D are stationary, and object C is moving; will most likely provide only moderate resolution. ((The 



6.028.626 

5 6 

preferred CCD/TV camera is about IVi inches in length and focus, by the image planes's is generated by CCD array into 

about 1 inch in diameter, weighing about 3 ounces, and for 2 halves and measuring the difference segmenting in each 

particular deployment, a zoom lens attachment may be until the object is at the center. Dividing the CCD array into 

used). This device will be operating continuously and will more than 2 segments, say 4 quadrants is a way to achieve 

translate the field of view ("real") images within a first 5 automatic centering, as is the case with mono-pulse radar, 

observation area into conventional video electronic signals. Regardless of the number of segments, the error signal is 

In another embodiment of picture input means 10, a high used to generate the desired tracking of the object 

rate camera/recorder, (similar to those made by NAC Visual In a wide field-of-view (WFOV operation, there may be 

Systems of Woodland Hills, Calif., SONY and others) is more than one object, thus special attention is given to the 

used, which would then enable the detection of even the very design of the zoom system and its associated software and 

rapidmovementof body parts that are indicative of criminal firmware control. Assuming 3 objects, as is the "2 on 1" 

intent, and their recording. The more commonly used cam- potential mugging threat described above, and that the 3 

era operates at 30 fi-ames per second will be able to resolve persons are all in one plane, one can program a shifting from 

essentially all quick body movements. one object to the next, firom one face to another face, in a 

The picture input means may also be activated by an prescribed sequential order. Moreover, as the objects niove 

^^alert" signal from the processor of the low resolution within the WFOV they wiU be automaUcaUy tracked m 

camera or from the audio/word recognition processor when ^^^^ elevaUon. In principle, the zoom would focus 

sensing a suspicious event. the nearest object, assuming that the amount of light on 

. . each object is the same so that the prescnbed sequence 

The picture input means for any embodiment contams a ^y^,^, object will proceed to the remaining 

preprocessor which normahzes a wide range of lUumination ^ 

levels, especiaUy for ouuide obse^^^^^ J / J ^ ^^^^^ 

to emulates a vertebrate's retina, which has an efficient and ^ 7\-7i vu » ^nrr^rv^r , ^,itu 

accurate normalization process. One such preprocessor planes but stdl within the camera s WFOV the zoom, w th 

(VLSI retina chip) is fabricated by the Carver Meade Labcv „ input from the segmentation subsystem of the Picture an^^^ 

ratory of the California Institute ofTechnology in Pasadena, sis me^ns 12 wm focus on the object closest to the n^^^^^^^^ 

Calif Use of this particular preprocessor chip wiU increase ^^\of the image plane, and then proceed to move the focu 

the automated vision capability of this invention whenever ^^e left, focusing on the next object and on the next 

variation of light intensity and light reflection may otherwise sequenUally. 

weaken the picture resolution. ^ ^ ^^ove cases, the automatic zoom can more 

^ . , ^ , ^ . , ^^^A naturally choose to home-in on the person with the brightest 

The signals from the picture input means 10 are converted . . ^ . °. , 

■ * J -5 J • I A * *^ emission or reflection, and then proceed to the next bnght- 

into digitized signals and then sent to the picture processmg , ^r^i^r™- uw ^: f ^^^^^^^/iJ^^ 

r o jj^gg ^Qj^ jjyg would be a form of an mtensity/time 

^^^^ ' ^ Mt 1. selection multiplex zoom system. 

The processor controlling each group of cameras will be ^ . . . r ^t. • * ^ **u 

*^ 5 y The relaUve positiomng of the input camera with respect 

governed by an artificial intelligence system, based on 35 ^ surveiUance will effect the accuracy by 

Jnamic pattern recogmtion pnnciples, as ftjrther descnbed ^^^^^^ ^^^^^^ ^^^^^^^^ ^^^^ ^ 

this preferred embodiment, it is beneficial for the input 

The picture processing means 12 includes an image raster camera to view the area under surveillance from a point 

analyzer which effectively segments each image to isolate ^ located direcdy above, e.g., with the input camera mounted 

each pair of people. j^gjj ^ wall, a utility tower, or a traffic light support tower. 

The image raster analyzer subsystem of picture process- xhe height of the input camera is preferably sufficient to 
ing means 12 segments each sampled image to identify and minimize occlusion between the input camera and the move- 
isolate each pair of objects (or people), and each 'iwo on ment of the individuals under surveillance, 
one" group of 3 people separately. Once the objects within each san^)led video frame are 

The "2 on 1" represents a common mugging situation in segmented (i.e., detected and isolated), an analysis is made 

which two individuals approach a victim: one from in front of the detailed movements of each object located within 

of the victim and the other from behind. The forward mugger each particular segment of each image, and their relative 

tells the potential victim that if he does not give up his movements with respect to the other objects, 

money, (or watch, ring, etc.) the second mugger will shoot image frame segment, once digitized, is stored in a 

him, stab or otherwise harm him. The group of three people fr^me by frame memory storage of section 12. Each frame 

will thus be considered a potential crime in progress and will fj.^^ camera input 10 is subtracted from a previous 

therefore be segmented and analyzed in picture processing already stored in memory 12 using any conventional 

means. differencing process. The differencing process involving 

An additional embodiment of the picture means 1 is the 55 multiple differencing steps takes place in the differencing 

inclusion of an optics system known as the zoom lens section 12. The resulting difference signal (outputted from 

system. The essentials of the zoom lens subsystem are the differencing sub- section 12) of eadi image indicates all 

described in three papers written by L. Motz and L. the changes that have occurred from one frame to the next. 

Bergstein, in an article titled "Zoom Lens Systems** in the These changes include any movements of the individuals 

Journal of Optical Society of America, Vol. 52, April, 1992. go located within the segment and any movements of their 

This article is hereby incorporated by reference. limbs, e.g., arms. 

The essence of the zoom system is to vary the focal length A collection of differencing signals for each moved object 

such that an object being observed will be focused and of subsequent sampled frames of images (called a 'Irack") 

magnified at its image plane. In an automatic version of the allows a determination of the type, speed and direction 

zoom system once an object is in the camera's field-of-view 65 (vector) of each motion involved and also processing which 

(FOV), the lens which moves to focus the object onto the will extract acceleration, i.e., note of change of velocity: and 

camera's image plane. An error which is used to correct the change in acceleration with respect to time (called 
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"jerkiness") and will when correlating with stored signatures picture input means (video camera) 20 and a conventional 

of known physical criminal acts. For example, subsequent monitor 22 and video recorder 24. The field of view of the 

differencing signals may reveal that an individual's arm is secondary camera 20 is preferably at most, the same as the 

moving to a high position, such as the upper limit of that field of view of the primary camera 10, surveying a second 
arm's motion, i.e., above his head) at a fast speed. This 5 observation area. The recorder 24 may be located at the site 

particular movement could be perceived, as described and/or at both a law enforcement facility (not shown) and 

below, as a hostile movement with a possible criminal intent simultaneously at a Court office or legal facility to prevent 

requiring the expert analysis of security personnel. loss of incriminating information due to tamp^ing. 

The intersection of two tracks indicates the intersection of The purpose of the secondary camera 20 is to provide a 

two moved objects. The intersecting objects, in this case, detailed video signal of the individual having assumed 

could be merely the two hands of two people greeting each criminal intent and also to in^)rove false positive and false 

other, or depending on other characteristics, as described negative performance. This information is recorded by the 

below, the intersecting objects could be interpreted as a fist video recorder 24 and displayed on a monitor 22. An alarm 

of an assailant contacting the face of a victim in a less bell or light (not shown) or both may be provided and 
friendly greeting. In any event, the intersection of two tracks 15 activated by an output signal from the controller 20 to 

immediately requires further analysis and/or the summoning summon a supervisor to inmiediately view the pertinent 

of security personnel. But the generation of an alarm, light video images showing the apparent crime in progress and 

and sound devices located, for example, on a monitor will access its accuracy. 

turn a guard's attention only to that monitor, hence the labor ju another embodiment of the invention, a VCR 26 is 
savings. In general however, friendly interactions between 20 operating continuously (using a 6 hour loop-tape, for 
individuals is a much slower physical process than is a example). The VCR 26 is being controlled by the VCR 
physical assault vis-a-vis body parts of the individuals controller 28. All the ^teal-time" images directly from the 
involved. Hence, friendly interactions may be easily distin- picture input means 10 are immediately recorded and stored 
guished from hostHe physical acts using current low pass least 6 hours, for example. Should it be determined 
and high pass filters, and current pattern recognition tech- 25 ^^^^ ^ ^ progress, a signal from the controller 18 is 
niques based on experimental reference data. gent to the VCR controller 28 changing the mode of record- 
When a large number of sensors are distributed over a ing from tape looping mode to non-looping mode. Once the 
large number facilities, for example, a number of ATMs VCR 26 is changed to a non-looping mode, the tape wiU not 
(automatic teller machines), associated with particular bank re-loop and will therefore retain the perhaps vital recorded 
branches and in a particular state or states and all operated video information of the surveyed site, including the crime 
under a single bank network control on a time division itself, and the events leading up to the crime, 
multiplexed basis, then only a single monitor is required. ^^n the non-looping mode is initiated, the video signal 
A conunercially available software tool may enhance may also be transmitted to a VCR located elsewhere; for 
object-movement analysis between frames (called optical exanq)le, at a law enforcement facility and, simultaneously 
flow computation), (see ref. 3 and 4) With optical flow to other secure locations of the Court and its associated 
computation, specific (usually bright) reflective elements. offices. 

caUed farkles, emitted from the clothing and/or the body p^^^^ ^^^^ ^-^^^^ ^-^^ compared with the ^*sig- 

parts of an mdividual of one frame are subtracted from a ^^^^^^^ ^-^^^^^^ ^^^^^ |^ memory, each sampled frame of 

previous frame. The bright portions will inherently provide ^ ^-^^ -segmented'* into parts relating to the objects 

sharper detafl and therefore will yield more accurate data detected therein. To segment a video signal, the video signal 

regarding the velocities of the relative movmg objects. derived from the vidicon or CCDAV camera is analyzed by 

Additional computation, as described below, will provide -^^^ ^^^^^^ analyzer. Although this process causes sUght 

data regarding the acceleration and even change m accel- ^-^^^^ ^^^^^^ accon^>llshed nearly in real time. 

eration or "jerkiness" of each moving part sampled, ^ _ . . . _ • v l- u i 

^ : 45 At certain sites, or in certain situations, a high resolution 

The physical motions of the individuals involved m an ^^^^^^ ^ otherwise used For example, 

interaction, will be detected by first determining the edges of ^j^^ resolution provided by a relatively simple and low cost 

the of each person imaged. And the movements of the body ^^^^^ sufficient. Depending on the level of security 

parts wUl then be observed by noting the movements of the particular location being surveyed, and the time of 

edges of the body parts of the (2 or 3) individuals involved ^^^^^ ^^^^ -^^^^^ between analyzed frames 

in the interaction. j^^y P^^. example, in a high risk area, every frame from 

The differencing process will enable the determination of the CCD/TV camera may be analyzed continuously to 

the velocity and acceleration and rate of acceleration of ensure that the maximum amount of information is recorded 

those body parts. prior to and during a crime. In a low risk area, it may be 

The now processed signal is sent to comparison means 14 55 prefOTed to sample perhaps every 10 frames from each 

which compares selected frames of the video signals from camera, sequentially. If, during such a sampling, it is deter- 

the picture input means 10 with ''signature** video signals mined that an abnormal or suspicious event is occurring, 

stored in memory 16. The signature signals are representa- such as two people moving very close to each other, then the 

tive of various positions and movements of the body ports of system would activate an alert mode wherein the system 
an individual having various levels of criminal intent. Tlie 60 becomes "concerned and curious" in the suspicious actions 

method for obtaining the data base of these signature video and the sampling rate is increased to perhaps every 5 frames 

signals in accordance with another aspect of the invention is or even every frame. As described in greater detail below, 

described in greater detail below. depending on the type of system employed (i.e., video only. 

If a comparison is made positive with one or more of the audio only or both), during such an alert mode, the entire 
signature video signals, an output "alert** signal is sent from 65 system may be activated wherein both audio and video 

the comparison means 14 to a controller 18. The controller system begin to sample the environment for sufficient infor- 

18 controls the op^ation of a secondary, high resolution mation to determine the intent of the actions. 
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Referring to FIG. 2. several frames of a particular camera adequately registered with the video sensor. The *Tags" are 
output are shown to illustrate the segmentation process sensed in video are assumed friendly and authorized. This 
performed in accordance with the invention. The system information will simplify the segmentation process, 
begins to sample at frame K and determines that there are ^ light connected to each RF-ID card will be turned ON, 
four objects (previously determined to be people, as 5 when a positive response to an interrogation signal is 
described below). A-D located within a particular zone being established. The light will qjpear on the computer generated 
policed. Since nothing unusual is determined from the initial (also on the screen of the monitor) and the intersection 
analysis, the system does not warrant an "alert" status. of tracks clearly indicated, followed by their physical inter- 
People A, B, and D are moving according to normal, action. But also noted will be the intersection between the 
non-criminal intent, as could be observed lO tagged and the untagged individuals. In all of such cases, the 

A crime likelihood is indicated when frames K+10 segmentation process will be simpler, 

through K+13 are analyzed by the differencing process. And There are many manufacturers of RF-ID cards and 

if the movement of the body parts indicate velocity, accel- Interrogators, three major ones are. The David Sarnoff 

eration and "jerkiness" that compare positively with the Research Center of Princeton, NJ., AMTECH of DaUas, 

stored digital signals dq)icting movements of known crimi- 15 j^^. and MICRON Technology of Boise, Id. 

nal physical assaults, it is likely that a crime is in progress appUcations of the present invention include station- 

ary facilities: banks and ATMs, hotels, private residence 

Additionally, if a large velocity of departure is indicated j^alls and dormitories, high rise and low rise office and 
when person C moves away from person B, as indicated in residential buildings, public and private schools from kin- 
frames K+15 through K+17. a larger level of confidence, is ^0 ^gj-gg^en through high-school, colleges and universities, 
attained in deciding that a physical criminal act has taken hospitals, sidewalks, street crossing, parks, containers and 
plate or is about to, container loading areas, shipping piers, train stations, truck 

An alarm is generated the instant any of the above loading stations, airport passenger and freight facilities, bus 

conditions is established. This alarm condition will result in stations, subway stations, move houses, theaters, concert 

sending in Police or Guards to the crime site, activating the halls and arenas, sport arenas, libraries, churches, museums, 

high resolution CCDATV camera to record the face of the stores, shopping malls, restaurants, convenience stores, bars, 

person committing the assault, a loud speaker being acti- coffee shops, gasoline stations, highway rest stops, tunnels, 

vated automatically, playing a recorded announcement bridges, gateways, sections of highways, toll booths, 

warning the perpetrator the seriousness of his actions now warehouses, and depots, factories and assembly rooms, law 

being undertaken and demanding that he cease the criminal enforcement facilities including jaUs. 

act. After dark a strong tight will be turned on automatically. Further apptications of the invention include areas of 

The automated responses will be actuated the instant an moving platforms: automobiles, trucks, buses, subway cars, 

alarm condition is adjudicated by the processor. train cars, freight and passenger, boats and ships (passenger 

Furthermore, an alarm signal is sent to the police station and freight, tankers, service vehicles, construction vehicles, 

the same video signal of the event is transmitted to a court ^ud off-road, containers and their carriers, and airplanes, 

appointed data collection office, to the Public Defender*s also in military applications that will include but will 

office and the District Attorney's Office. be limited to assorted military ground, sea, and air 

As described above, it is necessary to compare the result- mobile vehicles and assorted mititary ground, sea, and air 

ing signature of physical body parts motion involved in a ^ mobile vehicles and platforms as well as stationary facilities 

physical criminal act, that is expressed by specific motion where the protection of low. medium, and high value targets 

characteristics (i.e., velocity, acceleration, change of are necessary; such targets are common in the military but 

acceleration), with a set of signature files of physical crimi- have equivalents in the civilian areas wherein this invention 

nal acts, in which body parts motion are equally involved. will serve both sectors. 

This comparison, is commonly referred to as pattern match- ^ deterrence to car-jacking a tiny CCDyTV camera 

ing and is part of the pattern recognition process. connected surreptitiously at the ceiling of the car. or in the 

The files of physical criminal acts, which involve body rear- view mirror, through a pin hole lens and focused at the 

parts movements such as hands, arms, elbows, shoulder. driver's seat, will be connected to the video processor to 

head, torso, legs, and feet we obtained, a priority, by record the face of the drive. The camera is triggered by the 

experiments and simulations of physical criminal acts gath- 50 automatic word recognition processor that will identify the 

ered from "dramas" that are enacted by professional actors. well known expressions commonly used by the car-jacker, 

the data gathered from experienced muggers who have been The video picture will be recorded and then transmitted via 

caught by the police as well as victims who have reported cellular phone in the car. Without a phone, the short video 

details of their experiences will help the actors perform recording of the face of the car-jacker will be held until the 

accurately. Video of their motions involved in these simu- 55 car is found by the police, but now with the evidence (the 

lated acts will be stored in digitized form and files prepared picture of the car-jacker) in hand. 

for each of the body parts involved, in the simulated physical in this present surveillance system, the security pa-sonnel 

criminal acts. manning the monitors are alerted only to video images 

The present invention could be easily implemented at which show suspicious actions (criminal activities) within a 

various sites to create effective "Crime Free" zones. In 60 prescribed observation zone. The security personnel are 

another embodiment, the above described Abnormality therefore used to access the accuracy of the crime and 

Detection System includes an RF-ID (Radio Frequency determine the necessary actions for an appropriate response. 

Identification) tag, to assist in the detection and tracking of By using computers to effectively filter out all normal and 

individuals within the field of view of a camera. noncriminal video signals from observation areas, fewer 

LD. cards or tags are worn by authorized individuals. The 65 security personnel are required to survey and *'secure" a 

tags response when queried by the RF Interrogator. The greater overall area (including a greater number of obser- 

response signal of the tags propagation pattern which is vation areas, i.e.. cameras). 
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It is also contemplated that the present system could be criminal act. Some examples of commonly used word 

applied to assist blind people "see". A battery operated phrases spoken by a criminal to a victim prior to a mugging, 

piiriabic version of the video system would automatically for example, include: "Give me your money", 'This is a 

idcniirV known objects in its field of view and a speech stick-up", "Give me your wallet and you won't get hurt" . . 

v> nthcsi/er would "say" the object. For example, "chair". 5 . etc. Furthermore, commonly used replies from a typical 

lablc". etc. would indicate the presence of a chair and a victim during such a mugging may also be stored as template 

words, such as "help", and certain sounds such as shrieks. 

|)qx-nding on the area to be policed, it is preferable that screams and groans, etc. 

ai IcaM mo and perhaps three cameras (or video sensors) are templates, from which inputted acous- 

used simultaneously to cover the area. Should one camera lo ^. , T -^i. ^ i. i. 

, I i - . , *u *„ ij u tic sounds are compared with, must be chosen carefully, 

sense a first level of criminal action, the other two could be ... . . ^ . ^ / 

nunipulaied to provide a three dimensional perspective J^'^g ^"^^o account the particular accents and slang of the 

on crage of the action. The three dimensional image of a ^^S^^S^ ^P^^^^ ^he region of concern (e.g., the southern 

l*N sical interaction in the poHced area would allow obser- "^'^'^^ U.S. wiU require a different template 44 than the 

vaiVio i.f a greater number of details associated with the 15 on^. "sed for a recognition system in the New York City 

Mcps: ji o>sl. threat, assault, response and post response. region of the U.S.). 

I'hc- l onvcrsitin from the two dimensional image to the three The output of the word recognition system shown in FIG. 

dinK-nsional inugc is known as "random transform". 4 is used as a trigger signal to activate a sound recorder, or 

In the cxu-iKlcd operation phase of the invention as more ^ camera used elsewhere in the invenUon, as described 

details Kit the r>hysical variation of movement characteristics below. 

oi physical threats and assaults against a victim and also the The preferred microphone used in the microphone input 

speaker independent (male, female of different ages groups) subsystem 40 is a shotgun microphone, such as those 

and dialeet independent words and terse sentences, with conunercially available from the Sennheiser Company of 

airrcspi>nding responses, will enable automatic recognition Frankfurt, Germany. These microphone have a super-car- 

of a criminal assault, without he need of guard, unless dioid propagation pattern. However, the gain of the pattern 

required b\ statutes and other extCTnal requirements. may be too small for high traffic areas and may therefore 

In another embodiment of the present invention, both require more than one microphone in an array configuration 

video and acoustic information is sampled and analyzed- to adequately focus and track in these areas. The propagation 

The acoustic information is sampled and analyzed in a pattern of the microphone system enables better focusing on 

similar manner to the sampUng and analyzing of the above- ^ moving sound source (e.g., a person walking and talking), 

described video information. The audio information is A conventional directional microphone may also be Used in 

sampled and analyzed in a manner shown in FIG. 4. and is Pl^^e of a shot-gun type microphone, such as those made by 

based on prior art. (references 6 and 7). the Sony Corporation of Tokyo, Japan. Such directional 

The employment of the audio speech band, with its microphones wiU achieve similar gain to the shot-^^ 

associated Automatic Speech RecogniUon (ASR) system, microphones, but with a smaller physical structure, 

will not only reduce the false alarm rate resulting from the A feedback loop circuit (not specificaUy shown) originat- 

video analysis, but can also be used to trigger the video and mg in the post processing subsystem 48 will direct the 

other sensors if the sound threat predates the observed threat. microphone system to track a particular dynamic source of 

Referring to FIG. 4, a conventional automatic word 4^ sound within the area surveyed by video cameras, 

recognition system is shown, including an input microphone An override signal from the video portion of the present 

system 4(K an analysis subsystem 42, a teniplate subsystem invention will activate and direct the microphone system 

44. a pattern comparator 46. and a post-processor and towards the direction of the field of view of the camera. In 

decision logic subsystem 48. other words, should the video system detect a potential 

In operation, upon activation, the acoustic/audio policing 45 crime in progress, the video system will control the audio 

system wQl begin sampling all (or a selected portion) of recording system towards the scene of interest. Likewise, 

nearby acoustic signals. The acoustic signals will include should the audio system detect words of an aggressive 

voices and background noise. The background noise signals nature, as described above, the audio system will direct 

are generally known and predictable, and may therefore be appropriate video cameras to visually cover and record the 

easily filtered out using conventional filtering techniques. 50 W^^nt source of the sound. 

Among the expected noise signals are unfamiliar speech. A number of companies have developed very accurate and 

automotive related sounds, honking, sirens, the sound of efficient, speaker independent word recognition systems 

wind and/or rain. based on a hidden Markov model (HMM) in combination 

The microphone input system 4(1 pick-up the acoustic with an artificial neural network (ANN). These companies 

signals and immediately filter out the predictable back- 55 include IBM of Armonk, N.Y.. AT&T Bell Laboratories, 

ground noise signals and amplify the remaining recogniz- Kurtzweil of Cambridge, Mass. and Lernout and Hauspie of 

able acoustic signals. The filtered acoustic signals are ana- Belgium. 

lyzed in the analysis subsystem 42 which processes the Put briefly, the HMM system uses probability statistics to 
signals by means of digital and spectral analysis techniques. predict a particular spoken word following recognition of a 
The output of the analysis subsystem is compared in the 60 primary word unit, syllable or phoneme. For example, as the 
pattern con^arater subsystem 46 with selected predeter- word "money" is inputted into an HMM word recognition 
mined words stored in memory in 44. The post processing system, the first recognized portion of the word is 
and decision logic subsystem 48 generates an alarm signal, "mon . . . The HMM system immediately recognizes this 
as described below. word stem and determines that the spoken word could be 
The templates 44 include perhaps about 100 brief and 65 "MONDAY", "MONopoly", or '*MONey", etc. The result- 
easily recognizable terse expressions, some of which are ing list of potential words is considerably shorter than the 
single words, and are commonly used by those intent on a entire list of all spoken words of the English language. 
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Therefore, the HMM system employed with the present cessing devices and systems (including the use of radar and 

invention allows both the audio and video systems to operate ladar and other devices that operate in all areas of the 

quickly and use HMM probability statistics to predict future electromagnetic spectrum) to detect threats and actual crimi- 

movements or words based on an early recognition of initial nal acts occurring with a field of view of a camera (a video 

movements and word stems. 5 sensor). The system described above, and according to the 

1. II 1 A A invention, initially requires the collection of "reference 

The HMM system may be equaUy employed in the video ^^^^ correspond to specific known acts of threat, 

recognition system. For example, if a person s arm quickly ^^^^ ^^^^^^ physical and verbal), and other physical 

moves above his head, the HMM system may determme that ^^^^ interactions that are considered Mendly or neu- 

there is a high probability that the arm wiU quickly come y-^^^ components of recorded 'Reference data" is 

down, perhaps indicating a criminal intent. lO ^^^^^ ^ physical movement dictionary (or data base). 

The above-described system actively compares input data while audio components of such reference data is stored in 

signals from a video camera, for example, with known a verbal utterance dictionary (or data base), 

reference data of specific body movements stored in in operation of the earlier described system, real time (or 

memory. In accordance with the invention, a method of "fresh") data is inputted into the system through one sensor 

obtaining the 'Reference data" (or ground truth data) is ^^^^i as a video camera) and immediately compared to the 

described. This reference data describes threats, actual reference data stored in either or both data bases. As 

criminal physical acts, verbal threats and verbal assaults, and described above, a decision is made based on a predeter- 

also friendly physical acts and friendly words, and neutral j^^^ algorithm. If it is determined that the fresh input data 

interactions between interacting people. compares closely with a known hostile action or threat, an 

According to the invention, the reference data may be alarm is activated to summon law enforcement, 

obtained using any of at least the following described three Simultaneously, a recording device is activated to record the 

methods including a) attaching accelerometers at predeter- hostile event in real time. 

mined points (for example arm and leg joints, hips, and the j^ic above-described reference data is preferably obtained 

forehead) of actors; b) using a computer to derive 3-D ^5 through the use of actors performing specific movements of 

models of people (stored in the computer's memory as pixel hostility, threats, and friendly and neutral actions and other 

data) and analyze the body part movements of the people; actors performing neutral actions of greetings and also 

and c) scanning (or otherwise downloading) video data from simulating a victim's response to acts of aggression, hostil- 

movie and TV clips of various physical and verbal interac- jty and friendship. According to the invention, accelerom- 

tions into a computer to analyze specific movements and eters are connected to specific points of the actors' bodies, 

sounds. Depending on the particular actions being performed by the 

While the above-identified three approaches should yield actors, the accelerometers may be attached to various parts 

similar results, the preferred method for obtaining reference of their bodies, such as the hands, lower arms, elbows, upper 

data is includes attaching accelerometers to actors while arms, shoulders, top of each foot, the lower leg and thigh, the 

performing various actions or "events" of interest: abnormal 35 neck and head. Of course other parts of the actors' bodies 

(e.g., criminal or generally quick, violent movements), nor- niay similarly support an accelerometer, and some of the 

mal (e.g., shaking hands, slow and smooth movements), and ones mentioned above may not be needed to record a 

neutral behavior (e.g., walking). particular action. 

In certain environments, in particular where many people The accelerometers may be attached to the particular body 

are moving in different directions, such as during rush hour 40 joint or location using a suitable tape or adhesive and may 

in the concourse of Grand Central Station or in Central Park, further include a transmitter chip that transmits a signal to a 

both located in New York City, it may prove very difficult to multi-channel receiver located nearby, and a selected elec- 

analyze the specific movements of each person located tronic filter that helps minimize transmission interference, 

within die field of view of a surveillance camera. To over- Alternatively, all accelerometer or a selected group may be 

come the analyzing burden in these environments, according 45 hard wired on the actor's body and interconnected to a local 

to another embodiment of the invention, the people located master receiver. The data derived from each accelerometer 

within the environment are provided personal ID cards that as the actor performs and moves his/her body, includes the 

include an electronic radio frequency (rf) transmitter. The instantaneous acceleration of the particular body part, the 

transmitter of each radio-frequency identification card change of acceleration (the jerldness of the movement), and. 

(RFID) transmits an rf signal that identifies the person 50 through integration processing, the velocity and position at 

carrying the card. Receivers located in the area of a surveil- any given time. These signals (collectively called "JAVP") 

lance camera can receive the identification information and are processed by known mathematical operators: FFT (fast 

use it to help identify the different people located within the Fourier transform), cosine transform or wavelets, and then 

field of the near by surveillance camera (or microphone, in stored in a matrix format for comparison with the same 

the case of audio analysis). In one possible arrangement, 55 processed "fresh" data, as described above. The JAVP data 

people may be issued an RFID card prior to entering a is collectively placed into a data base (image dictionary), 

particular area, such as a U.S. Tennis Open event. In such The image dictionary includes signatures of the threat and 

instance, a clearance check would be made for each person actual assault movements of the attacker and of the response 

prior to them receiving such a card. Once within the secure movement of the victim, paying particular attention to the 

area, surveillance cameras would associate card-holders as go movements of the attacker. 

less likely to cause trouble and would be suspicious of in making the "reference data", the weight or size of each 

anyone within the field of the camera's view not being actor is preferably taken into account. For example, ten 

identified by an RFID card. actors representing attackers preferably vary in weight (or 

As described above, the basic configuration of the inven- size) from 220 lbs. to 110 lbs. with conunonly associated 

tion (as shown in FIGS. 1 and 2) uses video and audio 65 heights. Similarly, ten actors representing victims are 

sensors (such as, respectively, a camera and a miaophone), selected. The twenty actors then perform a number (perhaps 

and potentially other active and passive sensing and pro- 100) choreographed skits or actions that factor the size 
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difference between an attacker and a victim according to the 
movement of the body part, acceleration, change of 
acceleration, and velocity for hostile, friendly, and neutral 
acts. An example of an neutral act may be two people merely 
walking past each other without interaction. 

Once an initial set of JAVP data is generated through the 
use of actors carrying accelerometers. as described above, 
further JAVP data may be generated simply by recording 
actors performing specific actions using a conventional 
video sensor (such as a video camera). In this case, the same 
physical acts involved in the same skits or performances are 
carried out by the actor aggressors and actor victims, but are 
simply recorded by a video camera, for example. The JAVP 
data is transformed using only image processing techniques. 
A matrix format memory is again generated using the JAVP 
data and compared to each of the corresponding body part 
signatures derived using the accelerometers as in the above- 
described case. In doing this, similarities and the closeness 
of the signatures of each body part for each type of move- 
ment may be categorized: hostile (upper cut. kicking, draw- 
ing a knife, etc), friendly (shaking hands, waving, etc.), and 
neutral (walking past each other or standing in a line). 
Modifications may be made to each of these signatures in 
order to obtain more accurate reference signatures, accord- 
ing to people of different size and weight. 

If the differences between the video-only JAVP data and 
the accelerometer JAVP data is more than a predetermined 
amount, the performances by the actors would be repeated 
until the difference between the two signatures is understood 
(by the actors) and corrections made. 

The difference between the accelerometer and video sen- 
sor signatures based on input of same physical movements, 
bounds the range of incremental change for the reference 
signatures. 

Typically accompanying each of the hostile, friendly, and 
neutral acts performed by the actors, spoken words and 
expressions are verbalized by the attacker and by the victim 
This audio-detection system includes a word-spotting/ 
recognition and word gisting system, according to the 
invention, which analyzes specific words, inflections, 
accents, and dialects and detect spoken words and expres- 
sions that indicate hostile actions, friendly actions, or neutral 
ones. 

The audio-detection system uses a shotgun-type micro- 
phone of a microphone array to achieve a high gain propa- 
gation pattern and further preferably employs appropriate 
noise reduction systems and common mode rejection cir- 
cuitry to achieve good audio detection of the words and oral 
expressions provided by the attacker and the victim 

Word recognition and word gisting software engines are 
commercially available which may easily handle the rela- 
tively few words and expressions typically used during such 
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a hostile interaction. The attacker' s and the victims reference 
words and word gisting of a hostile nature are stored in a 
verbal dictionary, as are those of friendly and neutral inter- 
actions. 

Referring to FIG. 5, in operation, according to this 
embodiment of the invention, physical movements and 
verbal utterances of people in a field of view of an area under 
surveillance are recorded by an appropriate video camera 
and microphone. Image data from the camera is processed 

' (e.g., filtered), as described above and compared to image 
data stored within the reference image dictionary, which is 
compiled in a manner described above. Similarly, audio 
information from the miaophone is processed (filtered) and 
compared with known verbal utterances from the reference 

' verbal dictionary, which is compiled in a manner described 
above. 

If either an image or a verbal utterance matches (to a 
predetermined degree) a known image or verbal utterance of 
J hostility, then an alarm is activated and recording equipment 
is turned on. 

An alternate approach using the above-described accel- 
erometer technique for obtaining the reference JAVP signals 
associated with hostle. friendly and neutral actions is to 

25 employ doppler radar, operating at very short wavelengths, 
imaging radar (actually an inverse synthetic aperture radar), 
also operating at veiy short wavelengths, or laser radar. It is 
preferred that these active devices be operated at very low 
power to prevent undesireable exposure of transmitted 

30 energy to the people located within an area of transmission. 
Among the benefits of using any of the above-listed active 
sensors is their ability to detect and analyze movements of 
selected body parts at a distance, in darkness (e.g., at night), 
and depending on the range, through inclement weather. 
What is claimed is: 

1. A method for determining criminal activity by an 
individual within a field of view of a video camera, said 
method comprising: 

^ sampling the movements of an individual located within 
said field of view using said video camera to generate 
a video signal; 
electronically comparing said video signal of said video 
camera with known characteristics of movements that 
45 are indicative of an individual having criminal intent; 
determining the level of criminal intent of said individual, 
said determining step being dependent on said elec- 
tronically comparing step; and 
generating a signal indicating that a predetermined level 
50 of criminal intent is present as determined by said 
determining step. 



